Big Batch SGD: Automated Inference using Adaptive Batch Sizes
نویسندگان
چکیده
Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it difficult to use them for adaptive stepsize selection and automatic stopping. We propose alternative “big batch” SGD schemes that adaptively grow the batch size over time to maintain a nearly constant signal-to-noise ratio in the gradient approximation. The resulting methods have similar convergence rates to classical SGD methods without requiring convexity of the objective function. The high fidelity gradients enable automated learning rate selection and do not require stepsize decay. For this reason, big batch methods are easily automated and can run with little or no user oversight.
منابع مشابه
Automated Inference with Adaptive Batches
Classical stochastic gradient methods for optimization rely on noisy gradient approximations that become progressively less accurate as iterates approach a solution. The large noise and small signal in the resulting gradients makes it di cult to use them for adaptive stepsize selection and automatic stopping. We propose alternative “big batch” SGD schemes that adaptively grow the batch size ove...
متن کاملThe Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
Stochastic Gradient Descent (SGD) with small mini-batch is a key component in modern large-scale machine learning. However, its efficiency has not been easy to analyze as most theoretical results require adaptive rates and show convergence rates far slower than that for gradient descent, making computational comparisons difficult. In this paper we aim to clarify the issue of fast SGD convergenc...
متن کاملAdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Training deep neural networks with Stochastic Gradient Descent, or its variants, requires careful choice of both learning rate and batch size. While smaller batch sizes generally converge in fewer training epochs, larger batch sizes offer more parallelism and hence better computational efficiency. We have developed a new training approach that, rather than statically choosing a single batch siz...
متن کاملRectified linear neural networks with tied-scalar regularization for LVCSR
It is known that rectified linear deep neural networks (RL-DNNs) can consistently outperform the conventional pretrained sigmoid DNNs even with a random initialization. In this paper, we present another interesting and useful property of RLDNNs that we can learn RL-DNNs with a very large batch size in stochastic gradient descent (SGD). Therefore, the SGD learning can be easily parallelized amon...
متن کاملRemoving Noise in On-Line Search using Adaptive Batch Sizes
Stochastic (on-line) learning can be faster than batch learning. However, at late times, the learning rate must be annealed to remove the noise present in the stochastic weight updates. In this annealing phase, the convergence rate (in mean square) is at best proportional to l/T where T is the number of input presentations. An alternative is to increase the batch size to remove the noise. In th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1610.05792 شماره
صفحات -
تاریخ انتشار 2016